Visualization (Exploring Co-variation)

Peter Ganong and Maggie Shi

January 27, 2026

Introduction

Skills hopefully acquired at the end of lecture

Take a two variables in a dataset. Visualize to learn more about how they co-vary.

Key cases of interest:

  • Categorical variable and a continuous variable
  • Two categorical variables
  • Two continuous variables

Categorical variable and continuous variable

Categorical vs. continuous: roadmap

  • penguins dataset
  • Boxplots
  • Densities
  • Small multiples

penguins dataset

url = ("https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/master/palmerpenguins/data/penguins.csv")
penguins = pd.read_csv(url)
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

penguins dataset

species appears to be a categorical variable

penguins['species'].value_counts()
species
Adelie       152
Gentoo       124
Chinstrap     68
Name: count, dtype: int64

Discussion question: is it a Nominal or Ordinal variable?

Categorical & continuous: box plot

mark_boxplot()

alt.Chart(penguins).mark_boxplot().encode(
    alt.X('species:N', title="Species"), 
    alt.Y('body_mass_g:Q', title="Body Mass (g)"),
)

Discussion question: what is the headline message from this graph? Submessages?

transform_density()

alt.Chart(penguins).transform_density(
    'body_mass_g',
        groupby=['species'], 
        as_=['body_mass_g2', 'density']
    ).mark_line().encode(
        alt.X('body_mass_g2:Q', title = "Body Mass (g)"),
        alt.Y('density:Q', title = "Density"), 
        alt.Color('species:N', title = "Species")
    )

transform_density(), scale to 0

alt.Chart(penguins).transform_density(
    'body_mass_g',
        groupby=['species'],  
        as_=['body_mass_g', 'density']
    ).mark_line().encode(
        alt.X('body_mass_g:Q', scale=alt.Scale(zero=True), title = "Body Mass (g)"),
        alt.Y('density:Q', title = "Density"), 
        alt.Color('species:N', title = "Species")
    )

Discussion question: what if we required the x-axis range to include zero? Would that improve or reduce clarity? Why?

Boxplot or density plots?

Discussion question: what messages come through more with the box plot? Through the density plot?

alt.Row: small multiples

alt.Chart(penguins).transform_density(
    'body_mass_g',
    groupby=['species'],
    as_=['body_mass_g', 'density']
).mark_line().encode(
    alt.X('body_mass_g:Q', title = "Body Mass (g)"),
    alt.Y('density:Q', title = "Density"),
    alt.Row('species:N', header=alt.Header(labelAngle=0), title = "Species") 
)

By year: colors or small multiples?

Discussion question: these two graphs show identical information. Which do you prefer, and why?

Colors or small multiples?

Two Categorical Variables

Two categorical variables: roadmap

  • Two ways to encode frequency as a third dimension: diamonds
    • size
    • color
  • A word of caution against 3D graphs

A word of caution: 3D graphs

You may have seen covariation between two variables depicted as a 3D plot before

  • 3D graphs are almost always not recommended – they distort perception and cannot accurately represent scale
  • altair does not create 3D graphs – for good reason!

Two Categorical Variables: summary

  • Encode frequency as color or size
  • Avoid 3D representations!

Two Continuous Variables

Two continuous variables: roadmap

  • movies ratings from Rotten Tomatoes and IMDB
  • diamonds: carat vs price

use alt.Size('count()')

alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Size('count()')
)

use alt.Color('count()')

alt.Chart(movies_url).mark_bar().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20), title = "Rotten Tomatoes Rating (%)"),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20), title = "IMDB Rating"),
    alt.Color('count()', title = "Count")
) 

Discussion question

Compare the size and color-based 2D histograms above. Which encoding do you prefer? Why?

Exploring covariation: summary

Scenario Functions
Categorical and continuous variable mark_boxplot()
transform_density()
alt.Row()
Two categorical variables size
color
Two continuous variables alt.Size('count()')
alt.Color('count()')
mark_boxplot()
binscatter